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Abstract 

String kernels are typically used to compare genome-scale sequences whose length makes alignment 
impractical, yet their computation is based on data structures that are either space-inefficient, or incur 
large slowdowns. We show that a number of exact string kernels, like the A:-mer kernel, the substrings ker¬ 
nels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian 
corrections, can all be computed in 0{nd) time and in o(n) bits of space in addition to the input, using 
just a rangeDistinct data structure on the Burrows-Wheeler transform of the input strings, which takes 
0(d) time per element in its output. The same bounds hold for a number of measures of compositional 
complexity based on multiple value of k, like the fc-mer profile and the fe-th order empirical entropy, and 
for calibrating the value of k using the data. 


I Introduction 

Given two strings and T^, a kernel is a function that simultaneously converts and into vectors 
and in R" for some n > 0, and computes a similarity or a distance measure between and T^, without 
building and storing T* explicitly M- Kernels are often the method of choice for comparing extremely 
long strings, like genomes, read sets, and metagenomic samples, whose size makes alignment infeasible, 
yet their computation is typically based on space-inefficient data structures, like (truncated) suffix trees, 
or on space-efficient data structures with large slowdowns, like compressed suffix trees (see e.g. [H [3] and 
references therein). The (possibly infinite) dimensions of T® are, for example, all strings of a specific family 
on the alphabet of and T^, and the value assigned to vector T® along dimension W corresponds to 
the number of occurrences of string W in T®, often rescaled and corrected in domain-specific ways. T® 
is often called composition vector, and a large number of its components can be zero in practice. In this 
paper we focus on space- and time-efficient algorithms for computing the cosine of the angle between two 
composition vectors and T^, i.e. on computing the kernel k(T^,T^) = N/V€ [—1..1], where 
N = J2w T^[IF]T^[IT] and I?® = J2w T®[IT]^. This measure of similarity can be converted into a distance 
d(T^, T^) = (1 — k(T^, T^))/2 G [0..1], and the algorithms we describe can be applied to compute norms of 
vector , like the p-norm and the infinity norm. When and are bitvectors, we are more interested 

in interpreting them as sets and in computing the Jaccard distance J(T^,T^) = ||T^ A T^||/||T^ V T^|| = 
||T^ A T^||/(||T^|| -I- ||T^|| -I- ||T^ A T^ll), where A and V are the bitwise AND and OR operators, and where 

II • 11 measures the number of ones in a bitvector. 

Given a data structure that supports rangeDistinct queries on the Burrows-Wheeler transform of each 
string in input, we show that a number of popular string kernels, like the fc-mer kernel, the substrings 
kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian 

*This work was partially supported by Academy of Finland under grant 250345 (Center of Excellence in Cancer Genetics 
Research). 


1 




corrections, can all be computed in 0{nd) time and in o(n) bits of space in addition to the input, all in 
a single pass over the BWTs of the input strings, where d is the time taken by the rangeDistinct query 
per element in its output. The same bounds hold for computing a number of measures of compositional 
complexity for multiple values of k at the same time, like the fc-mer profile and the fc-th order empirical 
entropy, and for choosing the value of k used in k-mer kernels from the data. All these algorithms become 
0{n) using the rangeDistinct data structure described in [3], and concatenating this setup to the BWT 
construction algorithm described in [3] , we can compute all such kernels and complexity measures from the 
input strings in randomized 0(n) time and in 0(n log ct) bits of space in addition to the input. Finally, we 
show that measures of expectation based on Markov models are related to the left and right extensions of 
maximal repeats. 


2 Preliminaries 

2.1 Strings 

Let S = [l..cr] be an integer alphabet, let ff = 0, ffi = —1 and #2 = —2 be distinct separators not in E, 
and let T = be a string. We assume a G o{^/n/ \ogn) throughout the paper. A k-mer is any 

string W G [1--0'] of length k > 0. We denote by friW) the number of (possibly overlapping) occurrences of 
a string W in the circular version of T, and we use the shorthand pt{W) = fT{W)/{n — \ W\) to denote an 
approximation of the empirical probability of observing W in T, assuming that all positions of T except the 
last |IT| ones are equally probable starting positions for W. A repeat IF is a string that satisfies friW) > 1. 
We denote by Ey(lF) the set of characters {a G [0..cr] : /r(alF) > 0} and by Ey(lF) the set of characters 
{b G [0..cr] : friWb) > 0}. A repeat W is right-maximal (respectively, left-maximal) iff \Y,'f{yV)\ > 1 
(respectively, iff |Ey(lF)| > 1). It is well known that T can have at most n— 1 right-maximal substrings and 
at most n — 1 left-maximal substrings. A maximal repeat of T is a repeat that is both left- and right-maximal. 

For reasons of space we assume the reader to be familiar with the notion of suffix tree STy of a string 
T, and with the notion of generalized suffix tree of two strings, which we do not define here. We denote by 
£(y) the string label of a node u in a suffix tree. It is well known that a substring W of T is right-maximal 
iff IF = i{v) for some internal node v of STy. We assume the reader to be familiar with the notion of suffix 
link connecting a node v with £(v) = aW for some a G [0..cr] to a node w with £(w) = IF: we say that 
w = suf f ixLink(z;) in this case. Here we just recall that suffix links and internal nodes of ST^ form a tree, 
called the suffix-link tree of T and denoted by SkTy, and that inverting the direction of all suffix links yields 
the so-called explicit IFemer links. Given an internal node v and a symbol a G [0..tT], it might happen that 
string a£(v) does occur in T, but that it is not right-maximal, i.e. it is not the label of any internal node of 
STy: all such left extensions of internal nodes that end in the middle of an edge are called implicit IFemer 
links. An internal node v of ST^ can have more than one outgoing Weiner link, and all such Weiner links 
have distinct labels: in this case, £{v) is a maximal repeat. It is known that the number of suffix links (or, 
equivalently, of explicit Weiner links) is upper-bounded by 2n — 2, and that the number of implicit Weiner 
links can be upper-bounded by 2n — 2 as well. 

2.2 Enumerating right-maximal substrings and maximal repeats 

For reasons of space we assume the reader to be familiar with the notion and uses of the Burrows-Wheeler 
transform of T, including the C array, the rank function, and backward searching. In this paper we use 
BWTr to denote the BWT of T, we use range(IF) = [sp(IF)..ep(lF)] to denote the lexicographic interval 
of a string IF in a BWT that is implicit from the context, and we use E^^ to denote the set of distinct 
characters that occur inside interval [i..j] of a string that is implicit from the context. We also denote by 
rangeDistinct(f, j) the function that returns the set of tuples {(c, rank(c, Pc), rank(c, Qc)) : c G where 

Pc and Qc are the first and the last occurrence of c inside interval {i..j], respectively. Here we focus on a 
specific application of BIA/Tr: enumerating all the right-maximal substrings of T, or equivalently all the 
internal nodes of ST^. In particular, we use the algorithm described in [3] (Section 4.1), which we sketch 
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here for completeness. 

Given a substring of T, let &i < &2 < • ■ • < be the sorted sequence of all the distinct char¬ 
acters in 17{W), and let ai, 02 ,..., a^, be the list of all the characters in E^(Vh), not necessarily sorted. 
Assume that we represent a substring bb of T as a pair repr(iy) = (chars[l..fc], f irst[l..fc -|- 1]), where 
chars[i] = bi, ra.n.ge{Wbi) = [first[z]..f irst[i -|- 1] — 1] for i G [l..fc], and range() refers to BWTt. Note 
that range(fy) = [first[l]..first[fc -|- 1] — 1], since it coincides with the concatenation of the intervals of 
the right extensions of W in lexicographic order. If W is not right-maximal, array chars in repr(bb) has 
length one. Given a data structure that supports rangeDistinct queries on BVl/Tr, and given the C array 
of T, there is an algorithm that converts repr(II^) into the sequence ai,... ,ah and into the corresponding 
sequence repr(aibb),..., TepT{ahW), in 0(de) time and 0(u^ logn) bits of space in addition to the input 
and the output [5], where d is the time taken by the rangeDistinct operation per element in its output, 
and e is the number of distinct strings aiWbj that occur in the circular version of T, where i G [I../ 1 ] and 
j G [l..fc]. We encapsulate this algorithm into a function that we call extendLeft. 

If GiW is right-maximal, i.e. if array chars in repr(aiW) has length greater than one, we push pair 
(repr(aiW), \W\ + 1) onto a stack S. In the next iteration we pop the representation of a string from the 
stack and we repeat the process, until the stack itself becomes empty. This process is equivalent to following 
all the explicit Weiner links from the node v of STy with i{v) = W, not necessarily in lexicographic order. 

Thus, running the algorithm from a stack initialized with repr(e) is equivalent to performing a depth-first 
(but not necessarily a preorder) traversal of the suffix-link tree of T, which guarantees to enumerate all the 
right-maximal substrings of T. Every operation performed by the algorithm can be charged to a distinct 
node or Weiner link of ST^, thus the algorithm runs in 0{nd) time. The depth of the stack is O(logn) 
rather than 0{n), since at every iteration we push the pair (repr(aitT), |aiIT|) with largest rELnge(ailT) 
first. Every suffix-link tree level in the stack contains at most a pairs, and each pair takes at most crlogn 
bits of space, thus the total space used by the stack is 0 (ct^ log^ n) bits. The following theorem follows from 
our assumption that tr G o(i/n/logn): 

Theorem 1 ([3])- Let T G [l-.u]" ® string. Given a data structure that supports rangeDistinct 

queries on BWTy, we can enumerate all the right-maximal substrings W ofT, and for each of them we can 
return \W\, repr(IT), the sequence 01 , 02 , ■■■ ,ah of all characters in Ey(IT) (not necessarily sorted), and 
the sequence repr(ailT),..., repr(aft,IT), in 0(nd) time and in o(n) bits of space in addition to the input 
and the output, where d is the time taken by the rangeDistinct operation per element in its output. 

Theorem [I] does not specify the order in which the right-maximal substrings must be enumerated, nor 
the order in which the left extensions of a right-maximal substring must be returned. The algorithm we just 
described can be adapted to return all the maximal repeats of T, with the same bounds, by outputting a 
right-maximal string W iff |rangeDistinct(sp(IT), ep(IE))| > 1. A version of the same algorithm can also 
enumerate all the internal nodes of the generalized suffix tree of two string and T^, using BWTt'i and 
BWT 7 ’ 2 : in this case, a string W is represented as a quadruple repr'(IE) = (charsi[l..fci], f irsti[l..fci -|- 
1], chars 2 [l..fc 2 ], f irst 2 [l..fc 2 + 1]), and we assume that firsti[I] = 0 iff IT does not occur in Tb We call 
extendLeft' the function that maps repr'(IT) to the list of its left extensions repr'(ailT). 

Theorem 2 ([5]). Let G and G [l..cr ]"^“^#2 be two strings. Given two data struc¬ 

tures that support rangeDistinct queries on BIA/Tyi and on BWT 7 ’ 2 , respectively, we can enumerate all the 
right-maximal substrings W of T = , and for each of them we can return |IT|, repr'(IT), the sequence 

01 , 02 ,... ,ah of all characters in (IT) (not necessarily sorted), and the sequence repr'(ailT),..., repr'(aft,IT), 

in 0(nd) time and in o(n) bits of space in addition to the input and the output, where n = ni -\- n 2 and d is 
the time taken by the rangeDistinct operation per element in its output. 

For reasons of space, we assume throughout the paper that d is the time per element in the output of a 
rangeDistinct data structure that is implicit from the context. We also replace T* by i in subscripts, or 
we waive subscripts completely whenever they are clear from the context. 
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3 Kernels and complexity measures on k-mers 

Given a string T G and a length k > 0, let vector = [l..cr^] be such that Tfe[bh] = friW) for 

every W G [l..cr]^. The k-mer complexity Ck{T) of string T is the number of nonzero components of T^. 
The k-mer kernel of two strings and is k(T^, T^). Recall that Theorem [T] and [5] enumerate all nodes 
of a suffix tree in no specific order. In this section we describe algorithms to compute Cfc(T) and k(T^, T|) 
in a way that does not depend on the order in which the nodes of a suffix tree are enumerated: we can thus 
implement such algorithms on top of Theorem [T] and [2] The main idea behind our approach is a telescoping 
strategy that works by adding and subtracting terms in a sum, as described below: 

Theorem 3. Let T G be a string. Given an integer k and a data structure that supports 

rangeDistinct queries on BWTt, we can compute Ck(T) in 0(nd) time and in o(n) bits of space in addition 
to the input. 

Proof. A fc-mer of T can either be the label of a node of ST^, or it could end in the middle of an edge {u, v) 
of ST. In the latter case, we assume that the fc-mer is represented by its locus v, which might be a leaf. 
Let C/c(T) be initialized to n — k, i.e. to the number of leaves that correspond to suffixes of T of length at 
least fc + 1. We enumerate the internal nodes of ST using Theorem [1] and every time we enumerate a node 
V we proceed as follows: if 1^(1:)| < k we leave Ck{T) unaltered, otherwise we increment Cfc(T) by one and 
we decrement Cfe(T) by the number of children of v in ST, which is the length of array chars in repr(£(ti)). 
In this way, every internal node v of ST that is located at string depth at least k and that is not the locus 
of a fc-mer is both added to Cfc(T) (when the algorithm visits v) and subtracted from Ck{T) (when the 
algorithm visits parent(u)). Leaves at depth at least fc + 1 that are not the locus of a k-mei are added by 
the initialization of Cfc(T), and they are subtracted during the enumeration. Conversely, every locus u of a 
fc-mer of T (including leaves) is just added to Ck{T), since |£(parent(?;))| < k. □ 

We can apply the same telescoping strategy to compute k(TJ,, T^): 

Theorem 4. Let G [I..cr]"i“^#i and G [l..cr]”2“^#2 be strings. Given an integer k and two data 
structures that support rangeDistinct queries on BWTt^i and on BWT 7 ’ 2 , respectively, we can compute 
«:(T^,T^) in 0(nd) time and in o{n) bits of space in addition to the input, where n = ni +n 2 . 

Proof Recall that k{TI,TI) = N/y'DW^, where N = Eiv [IT]T2[W], = Ew and W G 

[I..tT]^. We initially set N = 0 and D'‘ = Ui — k, since these are the contributions of all the leaves at depth 
at least /c + I in the generalized suffix tree of and T^. Then, we enumerate every internal node u of the 
generalized suffix tree, using Theorem [2l if |£(w)| < k we keep all variables unchanged, otherwise we set N 
to iV + fi{£iu)) ■ / 2 (^(w)) - Ej, fiWv)) ■ / 2 Wt)) and we set to D* + fi{£{u)f - E„ ft{.£{v)Y, where v 
ranges over all children of u in the generalized suffix tree. Clearly fi{i{u)) = f irsti[fci +1] — f irsti[I] where 
ki is the size of array chars^ in repr'(£(it)), and fi{£{v)) = fi{£{u)bj) = firsti[j + 1] — firsti[j] for some 
j G \l..ki]. In analogy to Lemma |3l the contribution of the loci of the distinct fc-mers of T^, of T^, or of 
both, is added to the three temporary variables and never subtracted, while the contribution of every other 
node u at depth at least k in the generalized suffix tree is both added (when the algorithm visits u, or when 
N and are initialized) and subtracted (when the algorithm visits parent(u)). □ 

An even more specific notion of compositional complexity is Ckj(T), the number of distinct fc-mers that 
occur exactly / times in T. In the k-mer profiling problem 011] we are given a string T, an interval [ki..k 2 \ of 
lengths and an interval [/ 1 ../ 2 ] of frequencies, and we are asked to compute the matrix prof ile[fci..fc 2 , /i--/ 2 ] 
dehned as follows: prof ile[i, j] = Cij{T) if j < / 2 , and profile)^, j] = J2h>j ^i,h{T) if j = / 2 . Note that 
column j of profile can have nonzero cells only if fj is the frequency of some internal node of STt. In 
practice profile is often computed by running a fc-mer extraction algorithm — fci + 1 times, and by 
scanning the output of all such runs (see e.g. 0 and references therein). The following lemma shows that 
we can compute profile in just one pass over the BWT of the input string, and in linear time in the size 
of profile: 
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Theorem 5 . Let T G he a string. Given ranges [ki..k2] and [/i-./2], and given a data structure 

that supports rangeDistinct queries on B\NTt, we can compute matrix prof ±\e[ki..k2, in 0 {nd-\- 

(^2 — fci)(/2 ~ /i)) time and in o(n) bits of space in addition to the input and the output. 

Proof. We use Theorem [1] again. Assume that, for every internal node u of STt with string depth at least ki 
and with frequency at least /i, and for every k G [ki.. min{|£(u)|, fe}], we increment prof ile[fc, min{/(M), / 2 }] 
by one and we decrement prof ile[A:, min{/(u), / 2 }] by one for every child u of u in ST such that f{v) > fi. 
This would take 0{n^) total updates to profile. However, we can perform all of these updates in 
batch, as follows: for every node u of ST with /(u) > fi and with \i{u)\ > fci, we just increment 
prof ile[min{|£(it)|, ^ 2 }, min{/(u), / 2 }] by one, and we just decrement prof ile[min{|£(u)|, ^ 2 }; niin{/(u), / 2 }] 
by one for every child u of u in ST such that f(v) > f\. After having traversed all the internal nodes of 
ST, we scan profile as follows: for every j G [/i../ 2 ], we traverse all values of i in the decreasing order 
^2 — 1,..., fci, and we set prof ile[q j] = prof ile[i, j] +prof ile[i +1, j]. If /i = 1, at the end of this process 
the first column of profile contains negative numbers, since Theorem [T] does not enumerate the leaves of 
ST. Thus, before returning, we add to prof ile[q 1] the number of leaves with string depth at least ki + 1, 
i.e. value n — ki, for all i G [ki..k 2 \. □ 

A similar algorithm allows computing k(T^, T|) for all fc in a user-specified range [ki..k 2 ] in 0{nd+k2—ki) 
time. Matrix profile can be used to determine a range of values of k to be used in A:-mer kernels. The 
smallest number in this range is typically the value of k that maximizes the number of distinct fc-mers that 
occur at least twice in T [H]. The largest number in the range is typically determined using some measure 
of expectation: we cover this computation in Section [5l 

A related notion of compositional complexity is the k-th order empirical entropy of T, defined as Hfc(T) = 
(l/l^l) ■ T,wT,aGS-{w) friWa) ■ \og{fTiW)/fT(Wa)), where W ranges over all strings in [1 ..ct]'=. Clearly 
only the internal nodes of STt contribute to some Hfc(T) [3], thus our methods allow computing Hfc(T) for 
a user-specified range of lengths [ki..k2] in 0 (nd) time, using just one pass over BWTy. 


4 Kernels and complexity measures on all substrings 

Given a string T G [l..tT]"“^#, consider the infinite-dimensional vector Too indexed by all distinct substrings 
W G [l..tT] + , such that Too[kF] = friW). The substring complexity Coo{T) of T is the number of nonzero 
components of Too- The substring kernel of two strings and is the cosine of composition vectors T^ 
and T^. Computing substring complexity and substring kernel amounts to applying the same telescoping 
strategy described in Theorem |3] and |4l but with different contributions: 

Corollary 1. Let T G [l..cr]"'“^^ he a string. Given a data structure that supports rangeDistinct queries 
on EWTr, we can compute Coo(T) in 0{nd) time and in o(n) bits of space in addition to the input. 

Proof. The substring complexity of T coincides with the number of characters in that occur on all 

edges of STt- We can thus proceed as in LemmalU initializing Coo(T) to (n — l)n/2, or equivalently to the 
sum of the lengths of all suffixes of T[l..n — 1]. Whenever we visit a node v of ST, we add to Coo('T) the 
quantity |t'(u)|, and we subtract from Coo('T) the quantity \£{v) \ ■ |children(u)|. The net effect of all such 
operations coincides with summing the lengths of all edges of ST, discarding all occurrences of character ff. 
Note that \£{u)\ is provided by Theorem[Tl and |children(?;)| is the size of array chars in repr(£(u)). □ 

Corollary 2. Let G [l..cr]"i“^#i and G [l..cr ]”^“^#2 be strings. Given data structures that support 
rangeDistinct queries on BWTj’i and on BWTt 2 , respectively, we can compute k(T^,T^) in 0{nd) time 
and in o{n) bits of space in addition to the input, where n = ni + n 2 . 

Proof. We proceed as in Theorem [U setting again N — 0 and I?* = {ui — l)ni/2 at the beginning of the 
algorithm. When we visit a node u of the generalized suffix tree of and T^, we set A^ to A^ -I- \£{u) \ ■ 
(/i(£(u))/ 2 (^(u)) - and we set D* to + \t{u)\ ■ {fi{£{u)f - where v 

ranges over all children of u in the generalized suffix tree. □ 
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In a substring kernel it is common to weight a substring by a user-specified function of its length: 
typical choices are for a given constant e, or indicators that select only substrings within a specific 
range of lengths m- We denote by ^ a weighted version of the infinite-dimensional vector in which 
T^[VF] = g(|W^|) • fr'iW) and where g is any user-specified function. We assume that the number of bits 
required to represent the output of g with sufficient precision is O(logn). It is easy to adapt Corollary!^ to 
support this type of composition vector: 

Corollary 3. Let and be strings. Given a function g{k) that can be 

evaluated in constant time, and given data structures that support rangeDistinct queries on BWTt'i and 

on B\NTx 2 , respectively, we can compute in 0{nd) time and in o(n) bits of space in addition 

to the input, where n = ni + n 2 . 

Proof. We modify Corollary [2] as follows. Assume that we are processing an internal node v of the generalized 
suffix tree, let £{v) = W, and assume that we have computed repr'(aW) for all the left extensions aW of 
W. In addition to pushing repr'(aW) onto the stack, we also push value prefixSum(aW) = 
with it, where pref ixSum(aW) = pref ixSum(W) -I- g(|II^| + 1)^. When we pop repr'(aW), we compute its 
contributions to N and as described in Corollary [21 but replacing |aW| by pref ixSum(aW). We initialize 

Corollary [3] can clearly support distinct weight functions for and T^. For some functions, like 
prefix sums can be computed in closed form US], thus there is no need to push prefixSum values on the 
stack. Another frequent weighting scheme for a string W associates a score q(c) to every character c of 

W, and it weights W by e.g. q{W) = OS In this case we could just push pref ixSum(y) = 

S!=i n}=i ^(I^b])^ onto the stack, where V = aW and pref ixSum(y) = q{a)‘^ ■ (1 -|-pref ixSum(IF)). A 
similar weighting scheme can be used for k-mers as well. Let be a version of such that 'Tk,q\W] = 
/t(IF) — (|T| — \W\)q(W) for every W £ [l..cr]^, and consider the following distances defined in [13] : 

w 

^ TlJW]TljW]/ (v'(ni - k){n 2 - k) ■ q{W)) 

w 

where W ranges over all strings in [I..(t]^. We can compute such distances using just a minor modification 
to Theorem m 

Corollary 4. Let and £ [l..cr ]"^“^#2 be strings. Given an integer k and data 

structures that support rangeDistinct queries on BWTt’i and on BWTt’ 2 , respectively, we can compute 
and p,T^ p) in 0(nd) time and in A log cr-|-o(n) bits of space in addition to the input, 

where n = ni + n 2 and X is the length of the longest repeat in . 

Proof. We proceed as in Theorem [H pushing on the stack value q{W,k) = Y^j^iq{W[j]) in addition to 

repr'(IT), and maintaining a separate stack of characters to represent the string we are processing during 
the depth-first traversal of the generalized suffix-link tree. We set q{aW, k) = q{a) ■ q{W, k)/q{b), where b is 

the fcth character from the top of the character stack when we are processing W. □ 

An orthogonal way to measure the similarity between and consists in comparing the repertoire of 
all strings that do not appear in and in T^. Given a string T and two frequency thresholds ti < r 2 , a 
string VF is a minimal rare word of T if ti < friW) < T 2 and if /t(I^) > r 2 for every proper substring V 
of W. Setting ti = 0 and T 2 = 1 gives the well-known minimal absent words (see e.g. uniis] and references 
therein), whose total number can be Q{an) [5]. Setting ti = 1 and T2 = 2 gives the so-called shortest 
unique substrings (see e.g. [11] and references therein), whose total number is 0(ri), like the number of 
strings obtained by any other setting of ri > 1. In what follows we focus on minimal absent words, but our 
algorithms can be generalized to other settings of the thresholds. 
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To decide whether aWb is a minimal absent word of T, where a and b are characters, it clearly suffices 
to check whether fT{aWb) = 0 and whether both fT{o,W) > 1 and friWb) > 1. It is well known that only 
a maximal repeat of T can be the infix IT of a minimal absent word aWb, and this applies to any setting of 
Ti and T2. To enumerate all the minimal absent words, for example to count their total number C_(T), we 
can thus iterate over all nodes of ST^ associated with maximal repeats, as described below: 

Theorem 6 . Let T G be a string. Given a data structure that supports rangeDistinct queries 

on BWTr, we can compute C_(T) in 0(nd) time and in o(n) bits of space in addition to the input. 

Proof. For clarity, we first describe how to enumerate all the distinct minimal absent words of T: we specialize 
this algorithm to counting at the end of the proof. We use Theorem [1] to enumerate all nodes v of STt 
associated with maximal repeats, as described in Section 12.21 Let {ai,..., Uh} be the set of distinct left 
extensions of string i(v) in T returned by operation extendLeft(repr(u)), let extensions[l..cr + l, 0 ..cr] 
be a boolean matrix initialized to all zeros, and let leftExtensions[l..cr + 1] be an array initialized to 
all zeros. Let h' be a pointer initialized to one. Operation extendLeft allows following all the Weiner 
links from u, not necessarily in lexicographic order: for every string ai£(v) obtained in this way, we set 
leftExtensions[h'] = Oi, we enumerate its right extensions {ci,..., Cfc'} using array chars of repr(ai£(u)), 
we set extensions[h', Cj] = 1 for all j G and we finally increment h' by one. Note that only the 

columns of extensions that correspond to the right extensions of £{v) are updated by this procedure. Then, 
we enumerate all the right extensions { 6 i, ..., bk} of £{v) using array chars of repr(£(u)), and for every such 
extension bj we report all pairs (a^, bj) such that Oi = chars[a:], x G [l..h'], and extensions[a:, bj] = 0 . This 
process takes time proportional to the number of Weiner links from v, plus the number of children of v, plus 
the number of Weiner links from v multiplied by cr. When applied to all nodes of ST, this takes in total 
0{na) time, which is optimal in the size of the output. The matrices and vectors used by this process can 
be reset to all zeros after processing each node: the total time spent in such reinitializations in 0(n). 

If we just need C^{T), rather than storing the temporary matrices extensions and leftExtensions, 
we store just a number area which we initialize to hk before processing node v. Whenever we observe a 
right extension Cj of a string aii(v), we decrease area by one. Before moving to the next node, we increment 
C_(r) by area. □ 

Let T_ be the infinite-dimensional vector indexed by all distinct substrings W G [l..cr] + , such that 
T_[IT] = 1 iff IT is a minimal absent word of T. Theorem [5] can be adapted to compute the Jaccard 
distance between the composition vectors of two strings: 

Corollary 5 . Let G [l..cr]”i“^#i and G be strings. Given data structures that support 

rangeDistinct queries on BWTt’i and on BWT7'2, respectively, we can compute J(TL,T^) in 0{nd) time 
and in o(n) bits of space in addition to the input, where n = ni + n 2 . 

Proof. We apply the strategy of Theorem |6] to the internal nodes of the generalized suffix tree of and 

whose label is a maximal repeat of and a maximal repeat of T^: such strings are clearly maximal 
repeats of as well. We enumerate such nodes as described in Section [521 We keep a global variable 
intersection and a bitvector sharedRight[l..cr]. For every node v that corresponds to a maximal repeat 
of and of T^, we merge the sorted arrays charsi and chars2 of repr'(£(u)), we set sharedRight[c] = 1 
for every character c that belongs to the intersection of the two arrays, and we cumulate in a variable k' 
the number of ones in sharedRight. Then, we scan every left extension ai provided by extendLeft', we 
determine in constant time whether it occurs in both and T^, and if so we increment a variable h' by 
one. Finally, we initialize a variable area to h'k', and we process again every left extension Oi provided by 
extendLeft': if ai£{v) occurs in both and T^, we compute the union of arrays charsi and chars2 of 
repr'(ai£(z;)), and for every character c in the union such that sharedRight [c] = 1, we decrement area by 
one. At the end of this process, we add area to the global variable intersection. To compute ||TL V T^|| 
we apply Theorem [6] to and separately. □ 

It is easy to extend Corollary |S] to compute k(T 1 _,T?_), as well as to support weighting schemes based 
on the length and on the characters of minimal absent words. 
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5 Markovian corrections 


In some applications it is desirable to assign to component W G of composition vector Too an estimate 

of the statistical significance of observing friW) occurrences of Vb in T: intuitively, strings whose frequency 
departs from its expected value are more likely to carry “information”, and they should be weighted more 
m- Assume that T is generated by a Markov random process of order fc —2 or smaller, that produces strings 
on alphabet [l..cr] according to a probability distribution P. It is well known that the probability of observing 
IT in a string generated by such random process is P(IT) = ¥{W\l..k — 1]) • P(IT[2../c])/P(IT[2..fc — 1]). We 
can estimate P(IT) using the empirical probability pt( kb), obtaining the following approximation for P(IT): 
Pt{W) = pT{W[l..k — 1]) • pT{W[2..k])/pT{W[2..k — 1]) if pT{W[2..k — 1]) ^ 0, and pt{W) = 0 otherwise. 
We can thus estimate the signihcance of the event that substring W has empirical probability pxiW) in 
string T using the following score: zt{W) = {pt{W) — pt{W))/pt{W) if pt{W) 0, and zt{W) = 0 if 
Pt{W) = 0 [H]. After elementary manipulations [5], zt{W) becomes: 


zt{W) 

9{x,y) 


MW) ■ - l]) 

^ MW[i..k-i])-MW[2..k]) 

ix-y + 2fl {x-y + 1)(2: - p + 3) 


- 1 


Since g{x^ y) G [1..1.125], we temporarily assume g{x, y) = 1 in what follows, removing this assumption later. 

Let Tj be a version of the infinite-dimensional vector Too in which T 2 [IT] = zt{W). Among all strings 
that occur in T, only strings aWb such that a and b are characters in and such that IT is a maximal 

repeat of T can have Tz[aWb] 0. Similarly, among all strings that do not occur in T, only the minimal 
absent words of T have a nonzero component in T^: specihcally, T2[aIT5] = — 1 for all minimal absent words 
aWh of T, where a and b are characters in [0..cr] [5]. Given two strings and T^, we can thus compute 
using the same strategy as in Corollary[5l 

Theorem 7. Let G [1..(t]"i“^#i and G [1..(t]"^“^# 2 be strings. Given data structures that support 
rangeDistinct queries on EWT^^i and on BWT 7 ’ 2 , respectively, and assuming g{x,y) = 1 for all settings of 
X and y, we can compute k(TJ,T^) in 0{nd) time and in o{n) bits of space in addition to the input, where 
n = ni + n^. 

Proof. We focus here on computing component N of k(T^,T^): computing follows a similar algorithm 
on BWTT’i. We keep again a bitvector sharedRight[l. .cr], and we enumerate all the internal nodes of the 
generalized suffix tree of and whose label is a maximal repeat of T^T^, as described in Section [2T^ For 
every such node v, we merge the sorted arrays charsi and chars2 of repr' (£(?;)), we set sharedRight[c] = 1 
for every character c that belongs to the intersection of the two arrays, and we cumulate in a variable k' 
the number of ones in sharedRight. Then, we scan every left extension at provided by extendLeft', we 
determine in constant time whether it occurs in both and T^, and if so we increment a variable h' by 
one. Finally, we initialize a variable area to h'k' , and we process again every left extension provided 
by extendLeftb If afiiv) occurs in both and T^, we merge arrays charsi and chars2 of repr'(ai£(f)): 
for every character b in the intersection of charsi and chars2, we add to N value Zi{aii{v)b) ■ Z 2 {afi{v)b), 
retrieving the corresponding frequencies from repr'{afi{v)) and from repr'(^(i;)), and we decrement area 
by one. For every character b that occurs only in charsi, we test whether sharedRight[ 6 ] = 1: if so, OiWb 
is a minimal absent word of that occurs in T^, thus we decrement area by one and we add to N value 
—zi(ai£(v)b). We proceed symmetrically if b occurs only in chars2. At the end of this process, area counts 
the number of minimal absent words with infix £{v) that are shared by and T^: thus, we add area to 
N. □ 


It is easy to remove the assumption that g{x, y) is always equal to one. There are only two differences 
from the previous case. First, the score of the substrings W of T* that have a maximal repeat of T* as an 
infix changes, but gijii, \W\) can be immediately computed from |IF|, which is included in both repr(IF) 
and repr'(IF). Second, the score of all substrings W of T* that do not have a maximal repeat as an infix 
changes from zero to g{ni, \ W\) — 1: we can take into account all such contributions by pushing prefix-sums 



to the stack, as in Corollary [31 For example, to compute component N of «;(TJ,,T^), we can first assume 
that all substring W that occur both in and in have score g{ni,\W\) — l,hy pushing to the stack the 
prefix-sums described in [3] and by enumerating only nodes v of the generalized suffix tree of and such 
that £(v} occurs both in and in T^. Then, we can run the algorithm in TheoremjTl subtracting quantity 
{g{ni, \W\ -I- 2) — 1) • {g{n 2 , \W\ -|- 2) — 1) from the contribution to N of every string UiWb that occurs both 
in and in T^. 

Finally, recall that in Section [3] we mentioned the problem of determining an upper bound on the values 
of k to be used in k-mer kernels. Let be the composition vector indexed by all strings in [l..cr]^ such that 
Tfe[kF] = priW), and let Tk be a similar composition vector with T/c[kF] = pt{W), where priW) is defined 
as in the beginning of this section. It makes sense to disregard values of k for which and are very 
similar, and more formally whose Kullback-Leibler divergence KL(Tfc,Tfc) = ’ (log(Tfc[fF]) — 

log(Tfe[IF])) is small, where W ranges over all strings in [l..cr]^. Thus, we could use as an upper bound on k 
the minimum value k* such that KL(Tfe/, T^/) < t for some user-specified threshold r m- Note again 

that only strings aWb such that a and b are characters in [0 ..(t] and IF is a maximal repeat of T contribute to 
KL(T|iy|+ 2 i T|i 4 /|+ 2 )- We can thus adapt TheoremjTjto compute the KL divergence for a user-specified range 
of lengths [ki..k 2 ], using just one pass over BWTt, in 0(nd) time and in o(n) bits of space in addition to the 
input and the output. The same approach can be used to compute the KL-divergence kernel 
where = KLri(IF) and KL 7 ’i(IF) = Yl,a b&^PT'io,Wb) ■ (log(pTi(alF5)) — log(p 7 ’i(aIF6))). 
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